Skip to content

history: per-message cost, tokens, and latency tracking#16

Closed
akrentsel wants to merge 2 commits into
mainfrom
feature/message-cost-tracking
Closed

history: per-message cost, tokens, and latency tracking#16
akrentsel wants to merge 2 commits into
mainfrom
feature/message-cost-tracking

Conversation

@akrentsel

@akrentsel akrentsel commented May 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an optional UsageRecord to every EventData::Messages event so we have a durable, per-message record of:

  • Model id (echoed by the provider, or the requested binding as fallback)
  • Raw token counts — prompt, completion, cached, cache-creation, reasoning
  • USD cost — computed at call time from the LiteLLM pricing database (loaded at runtime, see below)
  • TTFT + wall-clock duration — measured in the executor on both streaming and non-streaming paths
  • server_duration_ms — reserved (lingua does not yet surface a provider-reported processing time)

Pricing: why runtime LiteLLM JSON, not a hardcoded table

OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.

Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.

This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):

  • First use: HTTP fetch into `$XDG_CACHE_HOME/exo/litellm_prices.json`
  • Subsequent uses (within 24h): cached
  • After 24h: re-fetch, fall back to stale cache if network fails
  • All fails: empty table → tokens persist, `cost_usd: None` (no crash)
  • Env var overrides: `EXO_LITELLM_PRICES_PATH` (local file) and `EXO_LITELLM_PRICES_URL` (alternate source)

Cached-vs-fresh: per-provider accounting matters

Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:

  • Anthropic-family (anthropic, bedrock_converse, vertex_ai-anthropic_models, azure_ai): prompt_tokens is fresh input only. cache_read and cache_creation are separate. Bill all three additively.
  • OpenAI-family (openai, mistral, etc.): prompt_tokens is total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.

The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.

Architecture

  • `exoharness::pricing` (pure data + math, no network, stays wasm-compatible):

    • `PricingTable::from_json_str` parses LiteLLM's schema (silently ignores entries that don't have per-token rates, like image/embedding entries).
    • `PricingTable::lookup` does exact + longest-prefix match (dated revisions like `claude-sonnet-4-6-20251022` resolve to `claude-sonnet-4-6`).
    • `PricingTable::compute_cost_usd` does per-provider math.
  • `executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):

    • `get_pricing_table()` returns `Arc`; loads once per process, then zero-cost.
    • Resolution: `EXO_LITELLM_PRICES_PATH` → cache (fresh) → fetch → stale cache → empty.
  • `BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.

What's in the event JSON now

```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```

All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.

Test plan

  • `cargo test --workspace` — 66 tests pass (was 60 on main; +11 pricing units, +1 end-to-end cost assertion, +2 backward-compat for UsageRecord JSON, -7 from refactoring the previous hardcoded-table tests away).
  • Pricing unit tests cover: Anthropic additive (no cache, cache hits, cache creation), OpenAI inclusive (with and without cache, subtraction semantics asserted), Bedrock regional surcharge, longest-prefix lookup, sample_spec doc entry skipped, unknown models, provider-style classification.
  • `pricing_loader::tests::local_path_override_is_honored` exercises the env-var override path.
  • End-to-end test `usage_record_is_persisted_with_computed_cost` uses `with_pricing_table` to inject an inline fixture — assertion is hermetic, no network dependency in CI.
  • `cargo test --package exo --test integration_chat -- --ignored` (mocked OpenAI + real local-process sandbox): unchanged, passes.
  • Loader exercised live: fetch succeeded, cache populated, subsequent calls cache-hit.

Not in scope (intentional)

  • `/cost` REPL command — persisting silently for now. UI surface is an easy follow-up.
  • Server-reported duration — lingua doesn't expose this today; field reserved for when it does.
  • Cache tier resolution — `UniversalUsage` collapses Anthropic's 5-min and 1-hour cache writes into one count. Cost defaults to the 5-minute rate. `cache_creation_input_token_cost_above_1hr` is parsed but unused.
  • Anthropic's aggregate Usage/Cost Admin API — separate org-level endpoint, aggregate-only, can't be tied to a specific message. Not useful for per-message tracking.

🤖 Generated with Claude Code

@akrentsel

Copy link
Copy Markdown
Collaborator Author

please don't review the code yet – I'm looking into a better way to get pricing details...

@akrentsel

Copy link
Copy Markdown
Collaborator Author

This is addressing #15

@akrentsel akrentsel force-pushed the feature/message-cost-tracking branch from 0de06df to 9933121 Compare May 26, 2026 22:56
@akrentsel akrentsel marked this pull request as ready for review May 27, 2026 16:31
@akrentsel akrentsel requested a review from ankrgyl May 27, 2026 16:49
@akrentsel

Copy link
Copy Markdown
Collaborator Author

@ankrgyl this is ready to review, added tests

@akrentsel

Copy link
Copy Markdown
Collaborator Author

@ankrgyl this is ready to review, added tests

@Alexsun1one Alexsun1one left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.

Comment thread crates/exoharness/src/pricing.rs Outdated
}
self.entries
.iter()
.filter(|(key, _)| model.starts_with(key.as_str()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.

if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.

Comment thread crates/exoharness/src/pricing.rs Outdated
// two when cached==0 (the typical case).
match provider {
Some(p) if p.starts_with("anthropic") => Self::Additive,
Some(p) if p.starts_with("bedrock") => Self::Additive,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.

the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.

probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.

Comment thread crates/executor/src/pricing_loader.rs Outdated

async fn try_load() -> anyhow::Result<PricingTable> {
// 1. Local path override — used by tests and air-gapped setups.
if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure

maybe we add this to a skill in the repo somewhere?

@akrentsel akrentsel force-pushed the feature/message-cost-tracking branch 3 times, most recently from a0092fc to 74c28a7 Compare June 12, 2026 05:14
akrentsel and others added 2 commits June 12, 2026 05:28
Add an optional UsageRecord to every EventData::Messages event: model id,
raw token counts (prompt / completion / cached / cache-creation /
reasoning), USD cost, and TTFT + wall-clock duration. Fields are Option +
skip_serializing_if and the record is boxed; legacy events still parse.

Cost is policy, computed in userspace, never by the trusted substrate:

- crates/cost: a standalone library with the price-table data model, a
  self-contained LiteLLM loader (explicit path/url, on-disk cache,
  degrade-to-empty), and per-provider math. Lookup is boundary-aware so
  dated revisions resolve without sliding a model onto a shorter neighbor's
  rate. Anthropic-family bills additively; everything else (including
  Bedrock, a TODO) is inclusive.
- exoharness stays minimal: it holds the UsageRecord schema and persists it
  verbatim, with no pricing code or dependency.
- Basic executor fills cost from a table loaded once at startup and injected
  via the CLI (--pricing-path / --pricing-url, env as fallback).
- The TypeScript harness (exoclaw) has its own self-contained cost port
  (@exo/model-runtime/cost) that owns its data loading (env override, own
  cache, own fetch) through the harness's normal config flow, so per-message
  cost works there with no dependency on the Rust loader or the trusted layer.

RLM is left unwired for now: its multi-call turn has different per-message
accounting and is a separate follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Write up the cost-tracking design: cost as a userspace policy library
(self-contained per language), the minimal substrate, per-provider math,
boundary-aware lookup, the loader, and the trust framing (usage is
agent-reported telemetry).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@akrentsel

Copy link
Copy Markdown
Collaborator Author

rewriting in #54

@akrentsel akrentsel closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants